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aWe phylogeactic trees j^neralhr u! 

fer any meani^l^-^^ 
of genome Drganixation (9). ^ 
A naive operational definitfan would 

mologous mtatfonswpa ComparfeorTof dSI l£l^ ,fied ^^"8 to their ho? J£*S *nfl«*T it rh e 

COG consults of Individual ortholwDua » orthoJogous groups (COGs). Each Z~ f"* 5 reaul o. especially when 

fctnra^-W a^ratscEitt SSjS^to&K 

tt^MiribiDtics. ^ pecw °or highly sigiufl-ant starieci-gj]^ J* 

*"3»» fern wdtfrk co^lS SnomL J*™ *^«h^ (20), fcLffS* 
ft becomes fe^ to ^ tj^SS ^jfr-Mtneth* similarity cutoff 

inyUofiiany orthnW™,, «iS=I™ 



!^ bacterial genomss 

sxpecoed c> grow exponentially fer at IrJr StunrfJ^ Ancestral gene. Such, a natural to ffi^J^ 0 * 01 * ^anda™ 

impact an biology will further increase tf) K ^ WollJl <°^ "«di» aid for rapid. ^ 95 *" ddl »"' rt f - 1 ' 

Knowing ^ tav rf^^j ^autom^Wional^SS 

M responsible fer hn^^mTfoTD ^JT"**} ^ fmme^oj 

*«nt phrenetic lineages li cenral m SSa? » ,Bftaa «P>"«»i.lth niiinlto 

of a Smi k cell. ComnlctB BemiEiicB^ fe^ P . BEnome5 ^ ^ small andea^ 
mispexaatie fer ^viag ^S a " ^0 J^ 7 T. 1 ^ fedi^idtX Here 

Pierc networfc of rektioiishiiu between • r»-*t. , 

c^v^^^^^^S C^t^f^J 3 ^ 10 ^ Deriving 
pliwf tt ^^fS^| a^-fOrthotagou.aoupf 

not encoded in a grven ^noa*. Acce S g 0 f^^f to , m ^ ^^"J as J2 D& ri^ Bd 0,1 

ty. to aitaaarive for ^ ^™"^f u uTT f 1 ^logDi* fe^g, ^TVq ^J 8018 » *t graph of BeTj. The 

evojvea rroaj a common ancestral mm, v. ^ Indeed, if a gene A*™ 

Normally s crchobp ^ th^^O ^J^^ Sf^ ^B^S^ 
« the couae of 9 voli^ wh SL ^'f^f^.'Hey are ban* fide ortho ^ 

iatfid eo riie origina] cna. TW n , tansies does nac dtoeM ^ wu 

d«cao n of gene fmctians frTnftX »n ""P 3 ™ P«««n« and chus allows Ab Sift 



Z'" a «™«nve PTDttin for the MBM«ivV 
Aiacoon should be eought among dSftSo 

multiple genoroe wjuen^ ft isp^ S 
delineate proam family C^l? 
«»»eTv«3 to one domain ^liTw^ 



r '"^ »e task of idendiyinB art) 
e ^°f 95 ^ delation of Erf 
orthologou, groups CCOQs). Ea^CDC? 
™ ^individual Ortho^r^S 

S ^7 1 ° 6Sn ^ ,c ^ oAeruwds, 

2m ^ Afferent line*™ 

Z! ■ J?J ^ aauitted to evolved from 

«! J** 1 *"* comparuoM among the 

Ptete genomes were performed (11) JiZT 
m* M, the heft hit (BeT)^^ 

P^the^^^S 

end most imrtortatir *f 
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BeTsfcnaing a triangle com- from widely 
Oferent : bnttges. Accordingly, only fc* 
major, phylogBRfttiBafly distant elides were 
used as independent contributors to COGs: 
GCTobiegadye bacteria (fitdiericte coU and 
w. mftuenxee), GnunljloBitive bacteria (My- 



patslogous-a- subimlo of B. cab' RNA poJyO 
«nse, bat only for one of ehem, RpoD, « 
thereknonship Bytnmemcal. 

Most of die dusters derived by the a bore 
procedure meet the definition of a COG 



So hS?* - fF,<n * i) (5afltW ^ cei 

The procedure used ro derive COOi inD 
duded folding all triangles fbrmed fay BeT* 
benveen die ftve major clades and merging 
those mangles dial had e common Hda 
uaM no new ones could be Joined. A rrlD 
an^C isan elementary, minimal COG (Fig. 

The groups produced by merging adO 
jacent triangles include onhnlogs from dtfO 
teent lineages and, in many cases, patalogs 
ton the rame lineage (Fig. 1,B and C). 
flKause ofrhe existence of psndogs, die 
BeTi thatfcrm the triangles are « 
aarUy symmeiricaL- For erample, in the 
COG shown m Fig. 1C, die same M. ieni- 
ohm protein, MG24P, i> the B»T for four 



Bemplae ofCOes. Solid tinea shew eym- 

525; corrafipondlnB 1c the species tor 

whlchthe BaT© observed. Qenes from the earns 
speefeB are acfaeerrt; otrterwlBs the gens names 
an Posnared arbtowiV. A unique 003 ID la hcV 
cateti l htteupparlstl comer. (A) Coneruent BeTs 
tama trtanffe, fernjntaaf COS. Origin of the 

£. ancf YKROSSc, S. cemvtsla*. Note that el the 
BeTe are symmetrical. (B) A slrnpte COG with two 
^•pajHtogs. Ortrjlnoflhe proteins: ltes.5 cot. 

Wftyaraaa; MG345, M. ga^jw 
M ^p22- M PrwifflDrtae; MJO947. M.Janmsm- 

the adjacent ujanalaa with B common Etta far 
exarnpte. IIBS-MG345-MJ0B47 and B »1SB2- 
MG345^Uia52.-VHjD40e s the' yeest rnto- 
■ awntmel fecfeticyMRNA BynthBtase; the bacterial 
o*olo» and that from M. JannaschS areTtrHs 
WTsforthfc yeest pitrtata, but the reversals true 
ony of the bacteria protalnB frryrrvnetriofll BaTs) 
ConvarsBry, fur YBLD7Bc. which is the yeest oyto- 
Ptonlc jroteucyf-tRMA (synthetase, the M. jann- 
f"™ ° r } t F to 3 a symmetrical BaT, whereas the 
SS^^L 4 ™ H ^ mm9trical ' (C) A complex 
v^mulhpiQ paralnae. Orfein of the proteins: 

and HTN16S& H. tntugnzae; MG349. At psnf- 
afijn; MP4BS, M. pmeumonte: 6B01B4 sBMa 

Oste 6P..F1P0D. HIN1655. sJrTJ65a, end MG2<9 

note the ftjy syrnmetricaf 
wetonshfaa between thesa proteins. Ths othar 
Pi«eins am specialized slgma factors whose tb- 
oiation from the anoesrral ftmiy apparently waa 
accompanied by modlflcBtloh of the function and 
^^ratedevoiuBon; nete the 



sons why, m certain cases, COOs may be 
lumped togedisr. Proteins may contain cwo 
or more distinct regions, each of which 
belongs to a diflferoit conserved femiiyr usuQ 
eiJy such procemfi are loosely refer B d to as 
^tadomam (M). Each of the cWb we* 
tospecced far the presence of multicWm 
ptoteins, individual dptxainB were isolantd 
and a second iceacion of tie Bequence 
carnpar^n was-perfonned with tie resukO 
vog databate of domains. Some of die COGs 
may include proteins from dtffen»ii lineages 
TO patalogs rather than orthoW priO 
manjy because of differential gene Win 
the major phyloienetic lineages. When one 
pae m a pair of paialogs is lost m one 

Tf, not ^ ^ two COOs 

tiiac should have been distinct may he airiO 
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ficially joined. Therefore, the level of 8 e0 
quence simllaritv hecwecn the tnembws of 
" &ch ^aa analyzd, and clusccra thar 
seemed ro contain cwo or more COG? wer(r 

Phytogenetic and Funcfional 
Patterns In COGs 

The described analysis resulted in 710 apO 
parent COGs. This eer sppsars'to be ^0 
t^Uy compiete as far ea ordxologous rclaO 
ttor^ups are concerned. Indeed, when the 
portion of the database of protein fe^ 
ccraplece genomes not included in the 
COp, was clustered by sequence fitmiferity 
U6h oxuyJO groups were identified, which- 
upon carcful insptction of the ali^runentB, 
were conadertd likely to coiudtucfi addiU 
tional COCh mimed originally. There 
groupe were Incorporated, producing the fiO 
nal qollection of 720 OOGs, including 6B14 
.proteins and distinct domains of muItidoO 
main prott±w (6646 distinct gene prodtiAts, 
or 37% of the total number of genes in the 
seven complete genomes) 07). 

Most of COGs are relatively small 
C°!JP«™ proteins. OneQkird of the COOs 
(240 COGs with 1406 proteins) contain 
one TepreeenmnVe of each, of the included 
species (no paralogy and 102 tnore COGs 
include paralogs from only one specie* 
most freouenrJy yeast (87 COGg). 
mean number of proteins per COG inoBaeO 
w with increasing number of genes in a 
genome, from 1.2 for M. genita&cm cd 2^ 
to yeart. A notable aspecx ofmanyCOGsis 
the dlferertftal behavior of paralogs. It b" 
typical that one of the paralogy for ^0 
pie, », yeast, shows consistently higher rimO 
parity to die orthologs in oil or most of the 
«her species (Fig. 1, B and C). For nutnerO 
ous yeast paialogs, particularly component* 
or tnerranslactoii apparattis, the underlyixjH 
cause is obvious: the gene, whose product is 
most nmilfir to the bacteria! orthologs h of 
mitochondrial origin (Rjef. IB). A more 
common explanation for che asyrnmetrv of 
the i^tianslups in the COQs r howevi, is 
that the highly conflerved paralog ha* reO 
tamed the original function, whereas the 
funcriom of the Less conserved paraloB 
have changed in the coutae of evolution. £ 
tne already consfderad example (Fig. ]C) 
the : Symmetrel compon^t of the gmph 
{solid limes) delineates the conserved funcO 
Hon of the tr70 subunjt of the'RNA polyO 
mexase (E, coitRpoD), which is requirdfor" 
the rattcription of the bulk of bacterial 
genes, whereas the asymmetrical BtTs (broO 
Z^v r*$ ^ for ^ submits (E. 

coh RpoH, RpaS, and FllA. ) involved m the 
^c^ption of specialized gene subsets 
liflj. This phenomenon, appears to be 
^despttsad, as we feund 54 9 proteins m502 
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ffltnecred 1 raMnc ? m a COG with E. coif or Synae^ocwT 



istttsstRSS r^sasssa 



^^^ap^tabedt* primarily 

(J0J. although rhe mentioned bias to * e 
eenome collection is also a facto:. 

The phybfitnerk distribution of the 



" — wuetgnur rrnm withta the 

conserved ones. Hie COGs wffl be an imO dons. TkL "ST* """""i"^ timet 
porant resource in a arsons su™ KZ" 68 °f F*hogen* barter 

coriwvd gene families. panUOe5 m cftfae two larger bacterial 

Thaic are asveraj w d , j^scmmJtmcnts, £. cob and Synec*cquaj 

current collection with corncle* reJ»H ft 7n Zh; . ^ *P ecifis almost always 

ships between aemb^ T^Trf X" ' f* the COGs. The main cauTrf 7^ 

Pa*c) components of ABC mmeaeaiaBd rf baaeO fcaal daW?^w7 ■ 

histidtns kin&BEs, each iaciudT^J irtn £j* ae *. m ^parasitic species from dlfQ diMte^n J v " ? 18 1101 '"""P"^ 

"tend,*,. l t fe lfcly^T^^T" S ^ m^.cjades. A=c<Xgly,TrS SESS ^ZS"*** « which 

called analysis of rL^U^T^ ,? U 434 P 1 ™**™ fom the pathomJe h*=tSa "«Wjs COGs bib predominant Another 

Sfft^^^^S ^«~*^i-i^«£r SSTL'SS&ats^ 

, , l, ^"^V when more genomes are p««j*h; in many COGj 

available. On a more general ttotE , OOGs 
AO i»r iupplanr traditional methods of nhyO 
togennie analysis but rather nrovide me 
appropriate starting material* for these 
method., mparticQlw lbs* systematic ana i0 
yias i of pnylogeneec tree topology. 

COG. by broadly defined function (29) ^ 
by sped* (20). for . the majcrirv of the 
UUt^the protjm. function ia either known 
horn direct experiments, mainly in E. cofi or 
rcast, or csn be wmfideady inWd 0n » c 
basis of Jigrwicatu sequence simUarity to ' 
functionally ehaacBtriied orotems feom 
other spaeieB- It has to be emphasized chat 
construcrim of the COGs includes suroO 
matic picdicdon of theflmctior. for numcrO 
oub eenes, particularly from the poorly cnarO 
actraied genomes such as M. fanuthS. 
^f^however, a substantial fraction of 
Ae COG, .(14%; fet wilich onl ^ 

fvmcuorial predicaon, typically of biochemO 

ical aca^ity, but not th= actual cdlular role 

could be made, and far another 5%, there 

£«no funcaonal clue [Fig. 3 ). Each of the 

OOGs include* proteins from at least ihree 
major clades whose divergence time is e» B 0 
mated to be over a billion years {21), that 
is, chey all are ancient, conserved families 
with important, if not necessarily essential, 
ceUuJar functions. Thexe&je, the r^tete 
belonpng to the "mysterious" COGs are 
fttdtT"^^ fOT dirficted «^rimenal 

The distribution of proteins from dlffnO 
eat species in the COGs thowa several 
trends (% 2), although the bias in the 
current collection of complete genomes (in 
particular, bem^ three 1^^ ^ nQ 
quired to form » COG. ail COGs had m 
have a bacterid member) must be taken 
^riJ*?^? ^« ^terprettag these cotnO 
parsons. The fiaction of proteins btloaging 
» COGi « greatest in the nearly nunimaL 
BBnomes of mycoplasmas (70% for M. jEiii. 




633 ami 



belong to me particular COS. The n^ban 5qm21 °L^° ra P 8r ^Ss from the given bd^ 
funeaonaj categories (ueed In me COGJ™ P lettars h ^ leftrr»st field encodetfie 
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meaeh functional category other than mmsO 
tatonsmd aanseriprion, but specially tithe 
metabolic functfonal classes. Conversely, the 
congruent between the two ncjiparasitic 

ttiith > different*! appears of£ 
potems due tend to group whh yeast oroO 
f» » translate ^ a^gT 
classes (which, given** bias in df, 
collection, results fa ubiquitous COGs) but 
» »D otheAnctteaJ classes are frequently 

Thephybgeneox distribution d COO 

of PW**«ie pattern,,* which snow 
. ?IP se ?f w awsnce of each anataed sneO 

lost three lineages (the definition of a COO), 
36 were actually found Missing were mostly 
patterns with only one of the two mjcies of 
Mycoplasma, which was predictable because 
ths an eanplemcnt of M. jEraafam is esO 
M - prcwwniae comO 
ptemen* (2}. The renaming eight patterns 
met were never observed all include paihoCl 
genie bacteria without E. «&, which i, the 
iagast and .men diverse of the available bacO 
tetisJ genones. The two most abundant patO 
terns could easily be predicted: all spedes 
( engpemy"), and all specie* exiepr for the 
toyeopW reb^cny"). What apoesn 
mneo less nival is that this- patterns togethO 
tt cncompgsj only oneGfaird of* all COGs. 
Tlue feet emphasises Ac remarkable .fluidity 
of genomes in evolution, revealed in spite of 
d* lact that the analysis conetntrated on 
assent cowerved families. MuUale solutions 
for the seme important cellular function apO 
par to be s ride rather than an exceprfon, tt 
^.^P^seneticall, distant specie?** 
answered (JO. 23). On the aher band, £ 
«gh£ aiosr frequent patterns, which together 

SLTJ* B £ rf ^ ^ 811 *SueV 

WthE. coL and Sjnofecjsm, emphasising the 
congruency between there genomes. 



(n A 14 ^^CCOs, most of then 
cotoponents of the tcaudateTand 
«^tta machbety, rorn, the uruversal 
c^rflrft This, set is more than twofold 
flown nooi the bacterial. ''imnjmal «• ' u^O 
genes (23), butsignhW 

brt^^m rfootapared genomes. 

S'^J^^Jomains of life, wMi 
«% «rf the COGs including repUsenO 
^vesofBacteria, A^ea, andS^* 
artier inanifatstion of the dynamics of 
6fne families in evolution (Kg. 3). The 
picaiie is «yecced.Eo become even more 

COGs wtll probably drop, once mrhaealO 

teryoat COG, emerge with the accumulaO 
■ ^on of genome sequences. 

Tat unusual, me patterns an ofpaxicD 
ulat Interest, suggesting the possibility of 
^«pected findings. Each £*?3>& 
with patterns that occur only once fa our 
cwent collection (Table 1) should eorreO 
spend toe unique function scattered over 
of the tree of life. 
Wtty such funcnons are conserved and are 
P"*«nably import** fe r survival in some 
butnot other Uneages is a challenge m be 
rfef ^if^^y- The principal 
^ ^ «n be InO 
voked to explain the cmergenc, of these 
differential gene loss and 
trarafer of genes. Some of the 
Masons wvolved, for examole, lipoateO- 
SJlf ^ ^^^^ ribciucleO 

umelated to one anorha (24). Other fS 

d^^l^ 5tUn3 ? aIt dehydrog«n a ^ s 7 aa 7 be 
dispensable under most conditions, and «0 
^Jg j^tiHlgeneloss^H,,^ 
temarleable, however, that these ruactioas 



ere preserved m the nearly minimal sens 
Kttplemeats of the mycoplasmas. Two 
the unique patterns, namely a __gpc_y " and 
-hgJUy," might have evolved throujfh 
hons««al omsfer of typical eukarvodc 
Eenes into bacterial genomes. TlTlktter 
SS" ° f ^ tiailaT M ^ ^Ives 

be^rffc )a f fl *! 1 gene ^ » » numO 
oer of fcaasaal pathogens and impu'eated in 
pathogenicity (25). Two of me COGs whh 

mdude highly conserved tut utu^So 
eed protems whose fmcdont could be preO 
dieted only by detafled analysis of conD 
served protein motifs (Table 1). These «D 
sapl« demonstrate the potential for proO 

T^J^^J^^ 011 to the 

ccmsnuction of the COGs themselves. 

BsmaU and based, and when a m«e « m 0 ■ 

OTOs by phylogeaetic patcetns is likeh to 

conunon when larger genomes 60m the ' 

hZ^ZS^"*® ' liMa ^ as 
rf" becoae available. NeverO 

theless, we believe that the languagVof 
Phylogenerie patterns wiU become %^ 

^ .top** of relaelonO 
amps bersreen niniciple genomes. 

Connecting and 
Expanding .the COGs 



coo. _ 2 ae to . ,000." j^,^ 
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Awfent fonlhes of paralogs that span a 
tooad range of era are well fcnowa (26) 
Accwdmgly, a number of COGs axe related 
tteacb other and can be etmneeted i W o 
a^ramibes. In order to elucidate the tuO 
pe^y stniceure of the COG collection, 
used therscently developed PSI&LAST 
(posinonQpedfie ItersoVe BLAST) proO 

profile anaryas (27). Two COGs were conO 
stdered connected if at two o/£r 
P»i»ns from the foe COG hie membosof 
the second COG m thePSIlSLASTTcSh 
■^vice versa. Q^ermg by thi, criS 

Compered to COGs themselves, the eu D 
f**«Jes a« 1 . higher level of protein 
«catioiL Typically, nKey mdudVSnQ 

ever, may be required for a variety of celluO 
^fimaions. For «tet^. d«5i^ 

^ I ?^J C TT t 53 0005 with 8«3 proO 
terns, all of which contain conserved metifi 
tYP'cal ©f ATPases end GTPases but a™ 

gNA repbetion to metabolite transport 
Supetfemilies and their signature modft 
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will be useful m cla^i^ proteins chat 
. have evolved to an eictnc that they canO 
nor be assigned co any ODD but still 
retain a conserved motif. We sought tc 
detect such proteins with distant, subtle 
similarity to COCs that might be encoded 
in the analysed genomes. The PSlCBLAST 
analyBis {27) detected "rails* 1 of distantly 
related proteins {a total- of 3586) for 321 
COGs, increasing the total number of proO 
tein£ connected to COG* to 10,332 (58% 
Of the entire protein set from complete 
genomes). 

Because apparent ortfeologsfcjm at least 
djree rn^or clades were required to W a 
UOG, there art potential new COOs hidO 
dm among die results of the comnarison of 
p»rein sernjences from complete' genomes 
(2 1)* Clustering by sequence similarity the 
proteins not included in COGs (J-f ) resultO 
ed to 443 groups with mashers from rare 
elsdes. Predictably, the greatest number, 
ffft ye» ™m the eyBjiobaetcxial and 
Gmxnlas^tive dad.es, followed by 67 
groups combining yeast and jamasduL 



Many of these- groups are likely to become 
OOCs once additional genomes are includO 
ed in the analysis. ' 



Prediction of Protein Functions 
with the COG System 



The COG system allows automatic nmcO 
tfonai and phyiqgenetic .annotation of 
genes and gene sets (29). As in the prceeO 
dure used far the construction of the COGs, 
toe ctteion for adding likely ortholc^ 
mm other genomes to the COGs is based 
on the eonsisiency between the observed 
TClardariships. A protein is compared to the 
darahase of protein sequences torn comO 
Z^P™ 1 *** W and k bieWed in a 
COOj if atkast rwo BeTs fall into iL Given 
tfttt the COGs were consttqeced from proO 
nitns encoded in complete genomes, it is 
not a requirement that newly included proO 
ttms also ordinate from a complete geO 
nome. Indeed, while the unsequenced poiO 
don of a genome; may encode premiss with 
the highest similarity to those included in 



COGs, the BeTs will not change far the 
products of already sequenced genes. 

*f B demonstration of the principle 
coupled with additional chare cteri*an : on 
of the COGs themselves, the sequences of 
proteins with known three ftimensiosa] 
structu res fa ™ the PDB database (30) 
were compared to the protein seqtrences 
encoded in complete genomes, Tht u m 0 
Procedure resulted in protein* with 
known three dimensional stracmre belnn 
included in 183 COGs, of^ch one w*s 
shown to be a fake positive by subsequent 
alignment analysis, Thus, structural inforO 
mation could be inferred for ar least 25% ' 
of the COGs. In most cases, the snucturO 
ally characterized protein (from £. cob or ' 
yeast) actually belongs to a COG or is a 
closely rented homotog of the proteins 
rorming a COG. 

Some of the predictions, however, proO 
vi ae sig nificant functional and stnictUTsI 
iflfesxices. Of particular interest ate (i) 
tte possibility of modeling rhe nuclease 
domain of poly adenylate cleavage fecrots 
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Pettom and 
COQ ID 



Proteins 



Adlvtty or function 



COG0213F 
COGQ246G 



COG0095H 



eh_pc_y 
COG0604R 



0030676* 

COG0B31R 

—SP^rny 
COGCM23J 



COGD622R 



erupomy 
COG007SE 

COGQ510M* 



Co mm ant 



DeoA-MQOBI -MP0B0- 

MJ03B7 
MtfD, IfcaB, UwB, Ycm, 

Ysto-MPI 90-V£|j0710w. 

YNR073c 

LptA-MG270-MP450- 
<B(IOB09)-YJLD4&hr 



protH>is-MP27B-sttO990, 
sin 1 B2-YBR046e + 19 
yeast proteins ■ 

YLRlOSw 
MG1D8-MP586-sfl1771- 
• sH1O53-bUO602-YDL0O9w 

+ e ysaat proteins 
MG251-MP4S3-MJD22B- 

YPROaic, YBR121C 



C2300-MG2D7, 
MJ0936-YHR012W 

At^I, ArgF, 

YgaV^HN0Cn2^P531- 

BGD9G2-MJ08B1-YJL0S6W 
HJW093&-MQt35e, * 

MP31MTO147W, 

YLR133W 



"Thymidhe phosphorylase; 

salvage of deaxypyrimlefirtes 
Mann&oM -phosphate and 

other hexuibnatB 

dehydrogenases; hsxunortats 

catabottsm 
UpoBtarpratBlrj ligase Ar ligation 

of spoata to apoproteina of 

pyruvate dehydrogenase and 

other fipoate-depfindsni 

enzymes 

Alcohol dahydroBanaBS class HI 

sndraiatedIVs 

dehydrogenases; various 

cataboUc pathways 
Gbtaretioxh-Bte membrane 

protein (prediction) 
Protein aeiha and tb^onlne 

phosohataaa 

0|ycyt-tflNA synthetase 
feukaryotic and Gram-posltlvs 



Phosphoesteraee (prediction) 



Ornithine carbamoyftransTerase; 
arolnine biosynthesis 

ChofirtB kinase (pradictjon) 
frwoVed in flpopofysacchartda 
obsynthaais 



^ffi^f^ £ ^ epparBnt °^° [ ^ **** h 
cBrtsonytfete metabdlem pfi), 

Uibib are hvo unrelated classes of Bpoate-protafh teas bet 
|P^WESt eftooda both farms; H. 

ftfra p7]. whWi was notautomstioaUy Indudecdntha 
COC3 but was cfatocted wfti PSI-Sl_AOT. 

"The H. //Starese proisln eontelns an addlttonal 
ttuoradcDtfWIte domain. 

S ^ r f?" 1, ^ eDrtne Prateh phoartiatBses are afiuneant 
m euksryotes out not in bacteria {30}, 

^c^HRNA that eppears to ba unrelated » the 
^^^ctfmd/iWtoHelspralyi.ffli* 
"^JSS?"^ K°L 9ln ^trtst shares orty morjasd 



E T^««' bacteria) pathogans and 
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(31) with the bets&ictainaee structure-, 
' (ii) the presence of an aeylpbosphatase 
domain in hydro^enase expression factory 
which Torn a highly conserved COG, and 
in a number of tmchsraeterwed proteins, 
and (iii) the connection between a unique 
carbonic anhydrase and an acetyltranfiO 
ferase family (Table 2). 

Probably the most Important applicaO 
tion of the COGs is functional charaeterO 
' iiarion of newly sequenced genomes, in 
the preliminary analysis of the secendy 
published genome of the major human 
bacterial pathogen Helicobacter pylori (32), 
813 proteins (51% of the gene products) 
from this bacterium were included in 453 
preGbcisiing COGs and 143 new COGs 
(33). In spirt of the feet that marry H. 
pylori proteins ' are highly similar to hoO 
mologs from E. coli and orher bacteria and 



have been explored in detail (32), this 
analysis produced over 1 00 additional 
functional predictions (33). 

Conclusions and Perspective 

The COGb bring together the fields of 
roraparative genomics and protein dasyiO 
fi cation. Among the numerous possible 
approaches to- protein classification^ die 
COGs appear to be unique as a prototype 
of a natural system, which has as its basic 
unit a group of descendants of a single 
ancestral gene. Typically, such a group is 
associated wrth a conserved, specific runcO 
tton, so that Jthe inclusion of a protein in 
a' COG automatically entails functional 
prediction, 

Each COG contains conserved genes 
iroia at lease three phylc^enetk^Ity disO 



mm clades and, accordingly, corresponds 
CD an ancient conserved region (ACR). 
Previous analyses have indicated that the 
total number of distinct ACUb is likely to 
be less than 1000 (34)- Thus, even with 
the limited number of complete genomes 
currently available for analysis* the COGs 
have already captured a substantial fraeO 
tion of all existing highly conserved proO 
tein domains. With more genomes includO 
ed in the system, the discovery of sddiO 
tional COGs should gradually level off, 
with the great majority of the ACR* enU 
coded in the added genomes fitting into 
already known COGs, 

With the forthcoming flood of 
sequence^ a coherent framework for imrietO 
standing these genomes from both the fchcO 
tional rod evolutionary viewpoints is a 
must. We regard the current ennectto of 



Table 2. Structural and functional pradlctlona-tor undTOractBTtesd proteins ft COG& 



Phyi eg eristic 
pattern and 
COG ID' 



Proteins tn COGf 



Activity 'and 
function 



' Homotog h FOB* 
-BeTs detected (no.) 
•Lowest P wtth s COG 
member 



Coronfiurrt 



e_gpcmy 
. COGD595R 



Bh^omy 
COG0607R 



etigpcLy 
COG0596R 



O0G006BC 



RhnP, 

VLB&77C, YWlRl37e, 
. YKR07BO 



SseA, PspE, GtpE, 
YbN, YbbB, YnjE, 
YoaP-2h-5c-MJD0S&ay 



PldB, MhpC, YcdJ, 

MGD20MP132-5C- 
YNR0S4C, YKL094W 

HypFsnoa22-MJ0713 



Predicted 
Zn-dspshdertt 
hydrafeees 



Predicted 
suttur- 
trsnsfsresss 



Ftpdldsd 
hydrolases and 
aqytoansferases 



HydrogenssB 
n"iaturatran 
factor 



Beta-iactErnaaD 
flBMC) 
•2 ■ 
-0.03B 



RhoetenesB DRHD, 
20RA, 10RS) 
-2 



pup, 
1TAH[B, icvl). 
-3 

(1APS) 
■2 X10" B 



COG0663R 



CaiE, Yrd^YdbZ-sinese. 
SB1031-MJO3O4 



Predicted 
carbonto 
arihy erases 



Carbonic anhydra&e 
from 

MairGnosarcina 
thermophBa (1THJ) 
•0 

OCT 29 



AcfMty Is net known for any protein in this ■ 
ubiquftous COG- ^ocriDrnVoal and genetic 
data indicate that YLR277c ~» involved ii 
messenger RNA S'-end procasaJhg (37), 
whereas YMfil 37c te "DNA ooss-Bnk repdr 
protein SNM1 (S3). AmcW Including thB 
Zh-eoaiTflnaUnQ hfetic&nes of bste-tacternase 
ia conserved. ' 
The BUffunTBnsfBrasa acth/tiy of SsaA has been 
. demonstrated (40), but ths rest of tie * - 
proteins in this COS have no known activity. 
Psp£ (phage shock protein), GipE 
{Lirtchamctgrteed protein Involved In glycerol 
mstaOotism). and other am all proteins 
correspond to ana of -the two rhodanBw 
dornalna- 

PldB Is known to poaaaas triglyceride lipase 
aotMty <<W). AH Other proteins In the COG 
hays not been cnamctedZDd out now cm t>a 
predicted to possess ths ot- or p-hydrtfase 
fold. 

HypF Is reqifirBd for hydro^nase btoBynHiasls 
|42)i but no bbohemlcal aotivtty la knovm. The 
-100 amino add, NH^terrnria] cbrrmln 
aRgn& wtth acylprjaBphatase, wfth the catalytic 
residues conserved, suggesting that HypF 
orthdog8 incteed possase acyiphosphaase 
activity. A PB)-BLA$T search wtth thia ttomaln 
as the query detected Ave addrflonal llkaly 
acylphoaphataees, nameiy ScctfYccXand 
M r jarmBSchfl MJQS09, MJOSSS, MJ1331, 
and MJ1405 m. 

Tna blcchombaf'acMy of the protBlns Inthla 
COG la not known. They show not only 
conservation ofhbtirfinerssldua sonprising 
tha aothe center of this un usual carbonic 
annydrase (44) but also significant slrnfcftylo 
ecctyttranfi(ara&e3 of the isoleudne patch 
aupsrtemHy [45} t suggesting an unexpected 
connection oatween the two types of 
enzymes. 



'The dstignations aa In Tebte 1 ^ncl Rg. 3* 

aroBBatan is mcicateo in pamrnnescs. 



tSd indicatEs two protElne 1ncm Af. genttaSivm f 2p hdcateatwD pmelnfi torn K pneumontes, end so 1&rtH tlYe PDB 
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as e crude firsc version of such a 
framewotk. Inclusion of additional, phyloO 
genetically diverse genomes and further dtO 
velopmmc of the procedures used to derive 
and analyie COGs will hopefully result in 
refine menr of this system, making Ir a solid 
platform for genome ttmocation and evoluO 
nonary genomics. 
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